Generalized Linear Models - Logistic Regression

image.png

Odds, Log Odds, Logit, Logistic

image.png

Regression

image.png

image.png

  • We need to "link" our regression to the bernoulli family
    • We want to predict a 0 or 1 for each class
    • Y is the probability of being in a particular class
  • Our original regression doesn't satisfy
    • Binary data doesn't follow the normal distribution
    • Probabilities usually aren't linear
    • Our values extends beyond 0 and 1

Odds

image.png

Logistic

image.png

Sigmoid

image.png

Logistic Regression

image.png

Setup

In [118]:
import pandas as pd
import numpy as np
import pandas_profiling
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

from sklearn.metrics import auc
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import log_loss
from sklearn.metrics import cohen_kappa_score

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

import statsmodels.api as sm

from yellowbrick.classifier import ROCAUC
from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import DiscriminationThreshold
from yellowbrick.features import Manifold

Data

In [48]:
breastCancer = load_breast_cancer()
print(breastCancer.DESCR)
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

.. topic:: References

   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.
In [49]:
breastCancerDF = pd.DataFrame(breastCancer['data'], columns=breastCancer['feature_names'])
breastCancerDF['TARGET'] = breastCancer['target']
breastCancerDF.head()
Out[49]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension TARGET
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

EDA

In [50]:
pandas_profiling.ProfileReport(breastCancerDF)
Out[50]:

Visualization

T-SNE

In [51]:
X = breastCancerDF.drop('TARGET', axis=1)
y = breastCancerDF['TARGET']
plt.figure(figsize=(15,10))
visualizer = Manifold(manifold='tsne', target='discrete')
visualizer.fit_transform(X, y)
visualizer.poof()

Classification

Correlated Features

  • Remove highly correlated features identified in pandas_profiling
In [52]:
select = [
    #'mean_radius', 
    'mean_texture', 
    #'mean_perimeter', 
    'mean_area',
    'mean_smoothness', 
    'mean_compactness', 
    #'mean_concavity',
    'mean_concave_points', 
    'mean_symmetry', 
    'mean_fractal_dimension',
    #'radius_error', 
    'texture_error', 
    #'perimeter_error', 
    'area_error',
    'smoothness_error', 
    'compactness_error', 
    'concavity_error',
    'concave_points_error', 
    'symmetry_error', 
    'fractal_dimension_error',
    #'worst_radius', 
    #'worst_texture', 
    #'worst_perimeter', 
    #'worst_area',
    'worst_smoothness', 
    'worst_compactness', 
    'worst_concavity',
    #'worst_concave_points', 
    'worst_symmetry', 
    'worst_fractal_dimension'   
]

Test and Train Split

  • Cut into Train & Test -- stratify Target
In [53]:
X = breastCancerDF[select]
y = breastCancerDF['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1337)

Pipeline

  • No need to scale variables for Logistic Regression
In [54]:
featuresNumeric = X.columns.tolist()
transformerNumeric = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

featuresCategorical = []
transformerCategorical = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot',  OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
        ('numeric',     transformerNumeric,     featuresNumeric),
        ('categorical', transformerCategorical, featuresCategorical)
])

GLM

Model

In [55]:
logistic = LogisticRegression()
logPipe = Pipeline([
    ('preprocess', preprocessor),
    ('logistic', logistic)
])
logPipe.fit(X_train, y_train)
Out[55]:
Pipeline(memory=None,
         steps=[('preprocess',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('numeric',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                                 verbose=0))],
                                                           verbose=False),
                                                  ['mean_texture', 'mean_area...
                                                                                 handle_unknown='ignore',
                                                                                 n_values=None,
                                                                                 sparse=True))],
                                                           verbose=False),
                                                  [])],
                                   verbose=False)),
                ('logistic',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='warn', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

Predictions

In [56]:
preds = logPipe.predict(X_test)
proba = logPipe.predict_proba(X_test)
print('Score Train {:.3f}'.format(logPipe.score(X_train, y_train)))
print('Score Test  {:.3f}'.format(logPipe.score(X_test, y_test)))
Score Train 0.941
Score Test  0.958
In [57]:
predDF = pd.DataFrame(proba, columns=['Probability False', 'Probability True'])
predDF['Predicted Class'] = preds
predDF.head()
Out[57]:
Probability False Probability True Predicted Class
0 0.999997 0.000003 0
1 0.010556 0.989444 1
2 0.197324 0.802676 1
3 0.999023 0.000977 0
4 0.992236 0.007764 0

Threshold

  • Defaulted to 0.5, but can be set from 0.0 to 1.0 depending on "cost" of error
  • Precision: An increase in precision is a reduction in the number of false positives
  • Recall: An increase in recall decreases the likelihood that we miss a positive class
  • F1 Score: The F1 score is the harmonic mean between precision and recall
  • Queue Rate: This metric describes the percentage of instances that must be reviewed.
In [58]:
plt.figure(figsize=(15,10))
visualizer = DiscriminationThreshold(logPipe)
visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.poof()     # Draw/show/poof the data

Errors

Accuracy

  • Proportion of labels we correctly identify
In [92]:
print('Test Accuracy = {:.3f}'.format(logPipe.score(X_test, y_test)))
Test Accuracy = 0.958

Log Loss

  • Common loss objective function
  • Quantifies the accuracy of a classifier by penalising false classifications
  • Minimising the Log Loss == Maximising the classifier accuracy
  • Log Loss penalizes erroneous predictions made with high confidence more than those with low confidence
    • Class = 1 & Predict Prob Class 1 = 0.5, Log Loss == 1
    • Class = 1 & Predict Prob Class 1 = 0.0, Log Loss == Very Large (+Infinity)
    • Class = 1 & Predict Prob Class 1 = 1.0, Log Loss == Very Small (Approaching 0) image.png
In [59]:
print('Log Loss: {:.3f}'.format(log_loss(y_test, preds)))
Log Loss: 1.449

Confusion Matrix

In [60]:
confusion_matrix = confusion_matrix(y_test, preds)
print(confusion_matrix)
[[51  2]
 [ 4 86]]

Classification Report

In [61]:
rpt = classification_report(y_test, preds)
print(rpt)
              precision    recall  f1-score   support

           0       0.93      0.96      0.94        53
           1       0.98      0.96      0.97        90

    accuracy                           0.96       143
   macro avg       0.95      0.96      0.96       143
weighted avg       0.96      0.96      0.96       143

AUC - ROC Curve

In [62]:
logit_roc_auc = roc_auc_score(y_test, preds)
fpr, tpr, thresholds = roc_curve(y_test, proba[:,1])
plt.figure(figsize=(15,10))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.style.use('seaborn-whitegrid')
plt.show()

AUC

In [63]:
print('AUC: {:.3f}'.format(logit_roc_auc))
AUC: 0.959

Cohen's Kappa

  • Baseline is found by always guessing the majority class
  • In our example we have 90 Positives out of 143 in test (~ 63%)
  • Intuition : If our accuracy were .63, our Cohen's Kappa would = 0
  • We can include weights on the errors we make
$$score = 1 - \frac{(1 - accuracy)}{(1 - baseline)}$$
In [123]:
print('Accuracy on Test : {:.3f}'.format(acc))
print('Baseline on Test : {:.3f}'.format(base))
print("Cohen's Kappa    : {:.3f}".format(cohen_kappa_score(logPipe.predict(X_test), y_test)))
Accuracy on Test : 0.958
Baseline on Test : 0.629
Cohen's Kappa    : 0.911

Yellowbrick Visualization

Confusion Matrix

In [64]:
plt.figure(figsize=(8,6))
visualizer = ConfusionMatrix(logPipe, classes=[0,1], percent=True, fontsize=20)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.poof()

Classification Report

In [65]:
plt.figure(figsize=(10,8))
visualizer = ClassificationReport(logPipe, classes=[0,1], support=True)
visualizer.fit(X_train, y_train)  # Fit the visualizer and the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer.poof()             # Draw/show/poof the data

ROC Curve

In [66]:
plt.figure(figsize=(15,10))
visualizer = ROCAUC(logPipe, classes=['No Cancer', 'Cancer'])
visualizer.fit(X_train, y_train)  
visualizer.score(X_test, y_test)  
g = visualizer.poof()             

Class Prediction Error

In [67]:
plt.figure(figsize=(8,6))
visualizer = ClassPredictionError(logPipe, classes=[0,1])
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.poof()

Interpretation

20 Variable Model

In [68]:
model = sm.Logit(y_train, X_train)
result = model.fit()
print(result.summary2())
preds20 = result.predict(X_test)
Optimization terminated successfully.
         Current function value: 0.052181
         Iterations 23
                                Results: Logit
===============================================================================
Model:                    Logit                Pseudo R-squared:     0.921     
Dependent Variable:       TARGET               AIC:                  84.4581   
Date:                     2019-08-14 06:57     BIC:                  165.5469  
No. Observations:         426                  Log-Likelihood:       -22.229   
Df Model:                 19                   LL-Null:              -281.44   
Df Residuals:             406                  LLR p-value:          7.5852e-98
Converged:                1.0000               Scale:                1.0000    
No. Iterations:           23.0000                                              
-------------------------------------------------------------------------------
                          Coef.    Std.Err.    z    P>|z|    [0.025     0.975] 
-------------------------------------------------------------------------------
mean_texture              -0.4457    0.2023 -2.2031 0.0276    -0.8422   -0.0492
mean_area                  0.0118    0.0065  1.8114 0.0701    -0.0010    0.0246
mean_smoothness          209.9882  222.2212  0.9450 0.3447  -225.5574  645.5338
mean_compactness          15.7817   54.7581  0.2882 0.7732   -91.5422  123.1055
mean_concave_points     -305.9473  134.6899 -2.2715 0.0231  -569.9347  -41.9599
mean_symmetry             36.7699   39.5186  0.9304 0.3521   -40.6852  114.2250
mean_fractal_dimension   673.9930  262.0340  2.5722 0.0101   160.4157 1187.5703
texture_error             -0.9892    1.2755 -0.7756 0.4380    -3.4891    1.5107
area_error                -0.2891    0.1090 -2.6520 0.0080    -0.5027   -0.0754
smoothness_error         819.7433 1147.0389  0.7147 0.4748 -1428.4116 3067.8981
compactness_error       -177.3971  256.1095 -0.6927 0.4885  -679.3625  324.5683
concavity_error          159.8302  207.5087  0.7702 0.4412  -246.8793  566.5397
concave_points_error    -208.4239  328.8264 -0.6338 0.5262  -852.9117  436.0640
symmetry_error            12.1492  247.5820  0.0491 0.9609  -473.1026  497.4010
fractal_dimension_error 2434.8311 1784.9822  1.3641 0.1725 -1063.6697 5933.3319
worst_smoothness        -116.5045  152.1608 -0.7657 0.4439  -414.7341  181.7252
worst_compactness         37.6908   32.8276  1.1481 0.2509   -26.6500  102.0316
worst_concavity          -26.5432   24.8584 -1.0678 0.2856   -75.2648   22.1785
worst_symmetry           -37.5570   26.7800 -1.4024 0.1608   -90.0449   14.9309
worst_fractal_dimension -378.2860  198.7443 -1.9034 0.0570  -767.8177   11.2456
===============================================================================

Coefficients

  • Coefficients affect Log Odds - Exponentiate to see change
  • 1 Unit change of coefficient relates to Log-Odds of Cancer
  • Coefficient changes are not linear!
  • Exponentiate coefficient to find the change in Odds
  • Slope is instructive (Negaitve is a reduction in odds)

Goodness of Fit

  • Deviance
  • Hosmer–Lemeshow test
  • Wald statistic
  • Likelihood ratio test

Observation

  • There are several extremely large coefficients and serveral with large p-values (intervals including 0).
    • fractal_dimension_error 2,434
    • smoothness_error 819
    • mean_fractal_dimension 673
  • We could be overfitting or including too many features in the model
  • We should look at:
    • Feature selection
    • Regularization

Feature Selection

Scoring Top N Features

In [87]:
metricAUC     = []
metricF1      = []
metricLogLoss = []
for features in range(20):
    rfe = RFE(logistic, features + 1, 1)
    rfe = rfe.fit(X_train, y_train)
    features_bool = np.array(rfe.support_)
    features = np.array(X.columns)
    topFeatures = features[features_bool].tolist()
    
    model = sm.Logit(y_train, X_train[topFeatures])
    result = model.fit()
    preds = result.predict(X_test[topFeatures])

    fpr, tpr, thresholds = roc_curve(y_test, preds, pos_label=1)
    metricAUC.append(auc(fpr, tpr))
    metricF1.append(f1_score(y_test, preds > 0.5))
    metricLogLoss.append(log_loss(y_test, preds))
Optimization terminated successfully.
         Current function value: 0.676617
         Iterations 4
Optimization terminated successfully.
         Current function value: 0.611109
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.570048
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.285709
         Iterations 9
Optimization terminated successfully.
         Current function value: 0.209117
         Iterations 9
Optimization terminated successfully.
         Current function value: 0.196793
         Iterations 9
Optimization terminated successfully.
         Current function value: 0.143605
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.143504
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.143503
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.142554
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.139779
         Iterations 10
Optimization terminated successfully.
         Current function value: 0.113325
         Iterations 11
Optimization terminated successfully.
         Current function value: 0.072982
         Iterations 12
Optimization terminated successfully.
         Current function value: 0.068003
         Iterations 13
Optimization terminated successfully.
         Current function value: 0.063058
         Iterations 13
Optimization terminated successfully.
         Current function value: 0.061358
         Iterations 13
Optimization terminated successfully.
         Current function value: 0.058490
         Iterations 16
Optimization terminated successfully.
         Current function value: 0.058201
         Iterations 16
Optimization terminated successfully.
         Current function value: 0.057354
         Iterations 18
Optimization terminated successfully.
         Current function value: 0.052181
         Iterations 23

Top N AUC

In [70]:
pd.DataFrame(metricAUC, columns=['Features']).plot(figsize=(15,10))
plt.title('F1 Score for Top N Features')
plt.ylabel('AUC')
plt
Out[70]:
<module 'matplotlib.pyplot' from '/Users/wilsons/anaconda3/envs/pyjup/lib/python3.6/site-packages/matplotlib/pyplot.py'>

Top N F1-Score

In [71]:
pd.DataFrame(metricF1, columns=['Features']).plot(figsize=(15,10))
plt.title('F1 Score for Top N Features')
plt.ylabel('F1 Score')
plt
Out[71]:
<module 'matplotlib.pyplot' from '/Users/wilsons/anaconda3/envs/pyjup/lib/python3.6/site-packages/matplotlib/pyplot.py'>

Top N Log Loss

In [89]:
pd.DataFrame(metricLogLoss, columns=['Features']).plot(figsize=(15,10))
plt.title('LogLoss for Top N Features')
plt.ylabel('LogLoss Score')
plt
Out[89]:
<module 'matplotlib.pyplot' from '/Users/wilsons/anaconda3/envs/pyjup/lib/python3.6/site-packages/matplotlib/pyplot.py'>

Top 15 Features

  • Best AUC
  • Lowest LogLoss
In [72]:
logistic = LogisticRegression()
logPipeRFE = Pipeline([
    ('preprocess', preprocessor),
    ('logistic', logistic)
])
In [73]:
rfe = RFE(logistic, 15, 1)
rfe = rfe.fit(X_train, y_train)
features_bool = np.array(rfe.support_)
features = np.array(X.columns)
topFeatures = features[features_bool].tolist()
topFeatures
Out[73]:
['mean_texture',
 'mean_smoothness',
 'mean_compactness',
 'mean_concave_points',
 'mean_symmetry',
 'mean_fractal_dimension',
 'texture_error',
 'area_error',
 'compactness_error',
 'concavity_error',
 'symmetry_error',
 'worst_compactness',
 'worst_concavity',
 'worst_symmetry',
 'worst_fractal_dimension']

15 Variable Model

In [74]:
model = sm.Logit(y_train, X_train[topFeatures])
result = model.fit()
print(result.summary2())
preds15 = result.predict(X_test[topFeatures])
Optimization terminated successfully.
         Current function value: 0.063058
         Iterations 13
                               Results: Logit
============================================================================
Model:                   Logit               Pseudo R-squared:    0.905     
Dependent Variable:      TARGET              AIC:                 83.7252   
Date:                    2019-08-14 06:57    BIC:                 144.5418  
No. Observations:        426                 Log-Likelihood:      -26.863   
Df Model:                14                  LL-Null:             -281.44   
Df Residuals:            411                 LLR p-value:         1.0626e-99
Converged:               1.0000              Scale:               1.0000    
No. Iterations:          13.0000                                            
----------------------------------------------------------------------------
                          Coef.   Std.Err.    z    P>|z|    [0.025   0.975] 
----------------------------------------------------------------------------
mean_texture              -0.2339   0.1201 -1.9480 0.0514   -0.4692   0.0014
mean_smoothness          -15.6331  70.0934 -0.2230 0.8235 -153.0136 121.7475
mean_compactness          -5.9160  42.9987 -0.1376 0.8906  -90.1919  78.3598
mean_concave_points     -144.4353  63.5917 -2.2713 0.0231 -269.0728 -19.7977
mean_symmetry             38.9617  32.8300  1.1868 0.2353  -25.3840 103.3074
mean_fractal_dimension   575.4133 178.1194  3.2305 0.0012  226.3056 924.5210
texture_error             -1.3620   1.0249 -1.3288 0.1839   -3.3708   0.6469
area_error                -0.1859   0.0622 -2.9898 0.0028   -0.3078  -0.0640
compactness_error         59.5255 102.0646  0.5832 0.5597 -140.5175 259.5685
concavity_error          112.6429 113.6343  0.9913 0.3216 -110.0764 335.3621
symmetry_error            80.6292 174.4354  0.4622 0.6439 -261.2579 422.5164
worst_compactness         10.1671  18.4709  0.5504 0.5820  -26.0352  46.3693
worst_concavity          -22.7067  14.8800 -1.5260 0.1270  -51.8710   6.4575
worst_symmetry           -38.8267  23.2766 -1.6681 0.0953  -84.4481   6.7946
worst_fractal_dimension -150.4990  77.0565 -1.9531 0.0508 -301.5270   0.5290
============================================================================

Comparison

  • The 15 feature model includes several features that have a high p-value
  • High p-value or intervals stretching across zero potentially could be removed
  • Alternatively, we could look to regularize the coefficients
In [75]:
resultDF = pd.DataFrame(columns=['LogLoss', 'AUC'])

20 Features

In [76]:
fpr, tpr, thresholds = roc_curve(y_test, preds20, pos_label=1)
area_under_curve = auc(fpr, tpr)
logloss = log_loss(y_test, preds20)

resultDF.loc['20 Feature Model'] = [logloss, area_under_curve]

print('AUC:      {:.3f}'.format(area_under_curve))
print('Log Loss: {:.3f}'.format(logloss))
AUC:      0.980
Log Loss: 0.282
In [77]:
preds20_class = preds20 > 0.5
print(classification_report(y_test, preds20_class))
              precision    recall  f1-score   support

           0       0.98      0.96      0.97        53
           1       0.98      0.99      0.98        90

    accuracy                           0.98       143
   macro avg       0.98      0.98      0.98       143
weighted avg       0.98      0.98      0.98       143

15 Features

In [78]:
fpr, tpr, thresholds = roc_curve(y_test, preds15, pos_label=1)
area_under_curve = auc(fpr, tpr)
logloss = log_loss(y_test, preds15)

resultDF.loc['15 Feature Model'] = [logloss, area_under_curve]

print('AUC:      {:.3f}'.format(area_under_curve))
print('Log Loss: {:.3f}'.format(logloss))
AUC:      0.988
Log Loss: 0.120
In [79]:
preds15_class = preds15 > 0.5
print(classification_report(y_test, preds15_class))
              precision    recall  f1-score   support

           0       0.96      0.94      0.95        53
           1       0.97      0.98      0.97        90

    accuracy                           0.97       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.97      0.96       143

Regularization

Grid Search Regularization

In [80]:
logistic = LogisticRegression()
regPipe = Pipeline([
    ('preprocess', preprocessor),
    ('logistic', logistic)
])

param_grid = {
    'logistic__penalty' : ['l1'],
    'logistic__C' : np.logspace(-4, 4, 20)
}

clf = GridSearchCV(regPipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)
best_clf = clf.fit(X_train, y_train)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    2.5s finished
In [81]:
best_clf.best_params_
Out[81]:
{'logistic__C': 29.763514416313132, 'logistic__penalty': 'l1'}

Regularized Model

In [82]:
model = sm.Logit(y_train, X_train)
result = model.fit_regularized(method='l1', alpha=1.0, L1_wt=29.763)
print(result.summary2())
predsReg = result.predict(X_test)
Optimization terminated successfully.    (Exit mode 0)
            Current function value: 0.30876724962981555
            Iterations: 89
            Function evaluations: 99
            Gradient evaluations: 89
                             Results: Logit
========================================================================
Model:                 Logit              Pseudo R-squared:   0.623     
Dependent Variable:    TARGET             AIC:                224.4079  
Date:                  2019-08-14 06:57   BIC:                248.7345  
No. Observations:      426                Log-Likelihood:     -106.20   
Df Model:              5                  LL-Null:            -281.44   
Df Residuals:          420                LLR p-value:        1.3850e-73
Converged:             1.0000             Scale:              1.0000    
No. Iterations:        89.0000                                          
------------------------------------------------------------------------
                         Coef.  Std.Err.    z    P>|z|   [0.025   0.975]
------------------------------------------------------------------------
mean_texture             0.0330   0.0408  0.8088 0.4187  -0.0470  0.1131
mean_area                0.0000      nan     nan    nan      nan     nan
mean_smoothness          0.0000      nan     nan    nan      nan     nan
mean_compactness         0.0000      nan     nan    nan      nan     nan
mean_concave_points      0.0000      nan     nan    nan      nan     nan
mean_symmetry           16.8160   4.2750  3.9336 0.0001   8.4372 25.1947
mean_fractal_dimension   0.0000      nan     nan    nan      nan     nan
texture_error            1.7585   0.4321  4.0694 0.0000   0.9115  2.6054
area_error              -0.1075   0.0147 -7.3164 0.0000  -0.1362 -0.0787
smoothness_error         0.0000      nan     nan    nan      nan     nan
compactness_error        0.0000      nan     nan    nan      nan     nan
concavity_error          0.0000      nan     nan    nan      nan     nan
concave_points_error     0.0000      nan     nan    nan      nan     nan
symmetry_error           0.0000      nan     nan    nan      nan     nan
fractal_dimension_error  0.0000      nan     nan    nan      nan     nan
worst_smoothness         0.0000      nan     nan    nan      nan     nan
worst_compactness       -0.3068   2.3238 -0.1320 0.8950  -4.8614  4.2477
worst_concavity         -6.4735   1.9787 -3.2716 0.0011 -10.3516 -2.5953
worst_symmetry           0.0000      nan     nan    nan      nan     nan
worst_fractal_dimension  0.0000      nan     nan    nan      nan     nan
========================================================================

Errors

In [83]:
fpr, tpr, thresholds = roc_curve(y_test, predsReg, pos_label=1)
area_under_curve = auc(fpr, tpr)
logloss = log_loss(y_test, predsReg)

resultDF.loc['Regularized Model'] = [logloss, area_under_curve]

print('AUC:      {:.3f}'.format(area_under_curve))
print('Log Loss: {:.3f}'.format(logloss))
AUC:      0.969
Log Loss: 0.233
In [84]:
predsReg_class = predsReg > 0.5
print(classification_report(y_test, predsReg_class))
              precision    recall  f1-score   support

           0       0.89      0.89      0.89        53
           1       0.93      0.93      0.93        90

    accuracy                           0.92       143
   macro avg       0.91      0.91      0.91       143
weighted avg       0.92      0.92      0.92       143

Overall Results

In [86]:
resultDF.sort_values('LogLoss')
Out[86]:
LogLoss AUC
15 Feature Model 0.120154 0.987631
Regularized Model 0.232784 0.968973
20 Feature Model 0.281644 0.980398

Advice & Tips

  • Advantages:
    • Computationally efficient
    • Interpretable
    • Does not require features scaling
    • Does not require tuning
    • Easy to regularize
  • Logistic Regression works best when:
    • Remove features that are highly correlated with eachother
    • Remove features that have little relationship with the dependent variable
  • Downsides
    • Can be easily outperformed by complex models
    • Can’t solve non-linear problems since it has a linear decision surface / boundary
  • Practical Tips
    • Consider feature selection
    • Test Regularization L1 & L2
    • Use statsmodel for interpreation
In [ ]: